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Abstract.  We  develop  an  adaptive  Hessian-based  non-stationary  Gaussian  process  response 
surface  method  to  approximate  a  probability  density  function  (pdf)  that  exploits  its  structure,  in 
particular  the  Hessian  of  its  negative  logarithm.  Of  particular  interest  to  us  are  pdfs  that  arise  from 
the  Bayesian  solution  of  large-scale  inverse  problems,  which  imply  very  expensive-to-evaluate  pdfs. 
The  method  can  be  considered  as  a  piecewise  adaptive  Gaussian  approximation  in  which  a  Gaussian 
tailored  to  the  local  Hessian  of  the  negative  log  probability  density  is  constructed  for  each  sub- 
region  in  high  dimensional  parameter  space.  The  task  of  efficiently  partitioning  the  parameter  space 
into  sub-regions  is  done  implicitly  through  Hessian-informed  membership  probability  functions.  The 
Gaussian  process  machinery  is  then  employed  to  glue  all  local  Gaussian  approximations  into  a  global 
analytical  response  surface  that  is  far  cheaper  to  evaluate  than  the  original  expensive  probability 
density.  The  resulting  response  surface  is  also  equipped  with  an  analytical  variance  estimate  that  can 
be  used  to  assess  the  uncertainty  of  the  approximation.  One  of  the  key  components  of  our  proposed 
approach  is  an  adaptive  sampling  strategy  for  exploring  the  parameter  space  efficiently  during  the 
computer  experimental  design  step,  which  aims  to  find  training  points  with  high  probability  density. 
The  detailed  construction  and  an  analysis  of  the  method  are  presented.  We  then  demonstrate  the 
accuracy  and  efficiency  of  the  proposed  method  on  several  example  problems,  including  inverse  shape 
electromagnetic  scattering  in  24-dimensional  parameter  space. 

Key  words,  probability  density  approximation;  Gaussian  process;  response  surface;  adaptive 
sampling;  computer  experimental  design;  non-stationary;  curse  of  dimensionality;  Bayesian  inversion; 
covariance  function;  membership  probability;  adjoint;  Hessian. 

AMS  subject  classifications.  62G07,  62G08,  62K20 

1.  Introduction.  Solving  large-scale  ill-posed  inverse  problems  that  are  gov¬ 
erned  by  partial  differential  equations  (PDEs)  is  both  of  great  practical  importance 
in  science  and  industry  as  well  as  tremendously  challenging.  Classical  deterministic 
inverse  methodologies,  which  provide  point  estimates  of  the  solution,  are  not  capable 
of  rigorously  accounting  for  the  uncertainty  in  the  inverse  solution.  The  Bayesian 
formulation  provides  a  systematic  quantification  of  uncertainty  by  posing  the  inverse 
problem  as  one  of  statistical  inference.  The  Bayesian  framework  for  inverse  problems 
proceeds  as  follows:  given  observational  data  and  their  uncertainty,  the  governing 
forward  problem  and  its  uncertainty,  and  a  prior  probability  density  function  (pdf) 
describing  uncertainty  in  the  parameters  m  E  M^,  the  solution  of  the  inverse  prob¬ 
lems  is  the  posterior  probability  distribution  7rp0st(m)  over  the  parameters.  Bayes’ 
Theorem  explicitly  gives  the  posterior  pdf  as 

7rpost(m|y0bs)  OC  ^Prior(m)^like(yobs|m)5 

which  combines  the  prior  pdf  7rprior(m)  and  the  likelihood  (yQbs  | m)  •  The  prior 

encodes  any  knowledge  or  assumptions  about  the  parameter  space  that  we  may  wish 
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to  impose  before  any  data  are  considered,  while  the  likelihood  7riike(yobs|m)  explicitly 
represents  the  probability  that  a  given  set  of  parameters  m  might  give  rise  to  the 
observed  data  yQbs  G  Mp.  For  simplicity  of  exposition,  we  assume  that  the  prior  is 
Gaussian  and  that  the  measurement  and  PDE  model  errors  are  combined  into  a  noise 
term  e  =  yQbs  —  f(m),  which  is  additive  and  i.i.d.  Gaussian.  Then  the  pdf’s  for  the 
prior  and  likelihood  can  be  written  in  the  form 

7rprior(in)  CK  exp  ^  mprior)  ■^'prior(m  mprior)^  •> 

7Tiike(e)  oc  exp  i(e  -  e)Tr-01ise(e  -  e)^  , 

respectively,  where  mprior  is  the  mean  of  the  prior  distribution,  e  the  mean  of  the 
Gaussian  noise,  rprior  G  MiVxAr  the  covariance  matrix  for  the  prior,  and  rnoise  G  Wxp 
the  covariance  matrix  of  the  noise.  Restating  Bayes’  theorem  with  these  Gaussian 
pdf’s,  we  find  that 

7Tpost(m)  oc  exp  (  -  hm  -  mpriorllr-1  -  hyobs  -  f(m)  -  e|£_ i  Y  (1.1) 

where  f(m)  is  the  (nonlinear)  operator  mapping  parameters  to  observations.  Note 
that  the  seemingly  simple  expression  f(m)  belies  the  complexity  of  the  underlying 
computations,  which  involves:  (i)  creation  of  the  PDE  model  for  given  parameters 
m;  (ii)  solution  of  the  governing  PDEs  to  yield  the  output  state  variables;  and  (iii) 
extraction  of  the  observables  (i.e.,  the  values  of  the  states  at  the  observation  locations 
in  space  and  time).  In  general,  f(m)  is  nonlinear,  even  when  the  forward  PDEs  are 
linear  in  the  state  variables,  since  the  parameters  couple  with  the  states  nonlinear ly 
in  the  forward  PDEs. 

As  is  clear  from  the  expression  (1.1),  despite  the  choice  of  prior  and  noise  prob¬ 
ability  distributions  as  Gaussian,  the  posterior  probability  distribution  need  not  be 
Gaussian,  due  to  the  nonlinearity  of  f(m).  The  non-Gaussianity  of  the  posterior  poses 
challenges  for  computing  statistics  for  typical  large-scale  inverse  problems  since  7rpost 
is  often  a  surface  in  high  (thousands  or  millions)  dimensions,  and  evaluating  each 
point  on  this  surface  requires  a  solution  of  the  forward  PDEs.  Numerical  quadrature 
to  compute  the  mean  and  covariance  matrix,  for  example,  is  out  of  the  question. 
Usually,  the  method  of  choice  for  computing  statistics  is  Markov  chain  Monte  Carlo 
(MCMC)  [28],  which  judiciously  samples  the  posterior  distribution,  so  that  sample 
statistics  can  be  used  to  approximate  the  exact  ones.  But  the  use  of  MCMC  for 
large-scale  inverse  problems  is  still  prohibitive  for  expensive  forward  problems  and 
high  dimensional  parameter  spaces,  since  even  for  modest  numbers  of  parameters,  the 
number  of  samples  required  can  be  in  the  thousands  or  millions.  Nevertheless,  MCMC 
can  be  more  efficient  by  exploiting  higher  order  information  such  as  the  Hessian  [44]. 

Since  solving  the  forward  PDEs  is  the  most  expensive  component  of  evaluating  the 
posterior  pdf,  one  can  employ  model  reduction  techniques  to  construct  inexpensive- 
to-solve  reduced-order  models  of  the  PDEs  [8,24,34,40,49,65].  On  the  other  hand,  one 
can  reduce  the  cost  of  evaluating  the  likelihood  directly  using  polynomial  chaos  [45,46] . 
One  can  also  pose  the  task  of  approximating  the  Bayesian  solution  as  a  density  esti¬ 
mation  problem,  for  which  there  is  a  vast  literature,  including  classical  density  estima¬ 
tion,  multi-dimensional  kernel  density  approximation,  and  mixture  density  estimation; 
see  [61,62]  and  references  therein.  Finally,  one  can  reduce  the  cost  of  evaluating  the 
parameter-to-observable  map  f(m)  by  approximating  this  so-called  response  surface 
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using  such  techniques  as  metamodels  or  radial  basis  functions  [47, 64] ,  and  Gaussian 
process  models  [30,31,36,53].  The  majority  of  these  methods  do  not  exploit  derivative 
(of  the  parameter-to-observable  map)  information,  which  is  our  goal  here. 

Here,  we  choose  to  directly  approximate  the  posterior  using  a  Hessian-based  Gaus¬ 
sian  process  response  surface.  This  results  in  an  inexpensive-to-evaluate  explicit  re¬ 
sponse  surface  equipped  with  an  analytical  uncertainty  estimate.  Thus,  this  “surro¬ 
gate  posterior  density”  can  be  sampled,  using  MCMC  for  example,  at  negligible  cost 
compared  to  sampling  the  original  posterior  density.  The  task  of  solving  a  statistical 
inverse  problem  therefore  reduces  to  approximating  a  function  over  high  dimensional 
parameter  space,  for  which  one  has  to  face  the  curse  of  dimensionality. 

The  term  “curse  of  dimensionality”  was  coined  by  Bellman  [5]  in  the  context  of 
optimization  to  reflect  the  fact  that  in  order  to  obtain  an  accurate  minimizer  within 
6  tolerance,  an  exponential  number  of  function  evaluations,  i.e.,  Q)  ,  is  required  if 
our  knowledge  about  the  cost  function  is  limited,  for  example,  to  Lipschitz  continuity. 
A  similar  curse  of  dimensionality  in  function  approximation  says  that  an  exponential 
number  of  function  evaluations,  i.e.,  Q)  ,  is  necessary  for  the  approximation  to  be 
uniformly  accurate  within  5  tolerance  if  Lipschitz  continuity  is  all  we  know  about  the 
approximated  function  [20]. 

In  the  context  of  statistics,  the  curse  of  dimensionality  reflects  the  fact  that  high 
dimensional  spaces  are  very  sparse  [61].  For  example,  the  ratio  of  the  volume  of  the 
inscribed  hypersphere  and  that  of  the  corresponding  hypercube  converges  to  zero  as 
N  approaches  infinity.  Another  example  is  that  the  volume  of  a  thin  shell  between 
hyperspheres  of  radii  r  and  r  —  e  converges  to  the  volume  of  the  hypersphere  of  radius 
r  as  TV  approaches  infinity  no  matter  how  small  e  is.  These  two  examples  show  that 
the  volume  content  of  hypercubes  and  hyperspheres  concentrates  near  their  surfaces. 
That  is,  the  center  of  these  objects  is  more  or  less  empty.  A  concrete  example  of 
the  sparsity  in  high  dimensional  space  is  the  hypercube  [—1,  l]10  whose  first  quadrant 
contains  only  the  fraction  2“ 10  ( 2~N  for  N-dimensional  space)  of  uniform  sampling 
data.  Furthermore,  almost  no  samples  can  be  found  in  the  inscribed  hypersphere. 

The  problem  of  approximating  a  pdf  in  high  dimensions  by  sample  points  is  a  good 
example  of  this  effect.  Since  the  integral  of  a  bona-fide  pdf  over  the  domain  of  interest 
is  at  most  unity,  the  pdf  must  be  negligible  everywhere  except  in  the  neighborhood 
of  the  modes.  In  addition,  if  the  modes  are  located  away  from  the  boundaries  of  the 
domain  of  interest  (which  is  true  for  most  practical  applications  in  which  we  choose 
the  domain  of  interest  to  be  sufficiently  large  to  contain  all  the  important  features  of 
the  problem  under  consideration),  random  sampling  methods  (especially  space- filling 
techniques)  will  tend  to  fail  to  find  the  high  probability  regions,  since  almost  no 
samples  will  be  in  the  neighborhood  of  these  regions.  In  other  words,  the  pdf  at  the 
sampling  points  will  most  likely  be  close  to  zero,  and  hence  any  reasonable  estimation 
or  interpolation  methods  based  on  these  values  will  yield  flat  response  surfaces  whose 
values  are  close  to  zero. 

Nevertheless,  the  curse  of  dimensionality  is  not  entirely  a  pessimistic  result.  In 
fact,  it  implies  that  one  might  be  able  to  reduce  its  impact  if  higher  order  information, 
for  example,  gradients  and  Hessians  of  the  pdf,  is  exploited.  This  has  indeed  been 
the  case  for  optimization  of  systems  governed  by  PDEs  (i.e.,  PDE-constrained  opti¬ 
mization),  where  the  combination  of  (Hessian-based)  inexact  Newton  methods  with 
appropriate  preconditioners  yields  methods  that  can  deliver  solutions  at  the  cost  of 
a  constant  number  of  forward  PDE  solves,  independent  of  the  dimension  of  the  opti¬ 
mization  variable  space  (e.g.,  [6,7,21,33]).  That  is,  using  a  suitable  class  of  Newton 
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methods  for  optimization  and  under  favourable  conditions,  the  curse  of  dimensionality 
in  optimization  can  be  mitigated,  at  least  for  locating  local  minima. 

A  natural  idea  is  therefore  to  cast  the  density  approximation  problem  as  an  op¬ 
timization  problem  for  which  the  effect  of  the  curse  of  dimensionality  can  be  lessened 
by  employing  higher  order  derivatives.  In  particular,  we  pose  the  sampling  task 
(i.e.,  the  task  of  selecting  training  points)  for  the  Gaussian  process  approximation 
as  a  sequence  of  optimization  problems  (solved  by  Newton  methods)  that  seek  to 
maximize  the  error  between  the  Gaussian  process  approximation  and  the  underly¬ 
ing  true  pdf.  These  points  also  tend  to  be  points  of  high  probability  density  of  the 
underlying  pdf.  Moreover,  a  “piecewise”  Gaussian  approximation  to  the  underly¬ 
ing  pdf  is  adaptively  constructed  with  local  covariance  matrices  that  are  inverses  of 
Hessians  of  the  negative  log  posterior  evaluated  at  the  interpolation  points.  (As  is 
well  known,  when  the  parameter-to-observable  map  f(m)  is  linearized,  the  posterior 
covariance  matrix  is  equivalent  to  the  inverse  of  this  Hessian.)  This  proposed  Hessian- 
based  Gaussian  process  method  for  Bayesian  interpolation  exploits  previous  work  on 
adaptive  choice  of  interpolation  points  in  reduced  model  construction  using  a  greedy 
algorithm  [11,14,27]. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2  reviews  important 
characteristics  of  conventional  Gaussian  process  response  surfaces,  among  which  are 
the  equivalence  with  radial  basis  function  approximation,  and  with  Bayesian  interpo¬ 
lation.  This  motivates  us  to  develop  a  non-stationary  Hessian-based  Gaussian  process 
in  Section  3,  followed  by  a  heuristic  adaptive  sampling  strategy  for  computer  exper¬ 
imental  design  in  Section  4.  Next,  Section  5  provides  an  analysis  of  our  proposed 
approach.  Section  6  details  the  choice  of  the  error  function,  numerical  optimization 
methods,  initialization,  and  how  to  update  the  training  set.  Verification  of  the  pro¬ 
posed  response  surface  methodology  is  carried  out  in  Section  7  for  several  probability 
density  approximation  problems,  including  the  problem  of  Bayesian  inference  of  the 
shape  of  a  scatterer  from  noisy  observations  of  scattered  electromagnetic  waves.  Fi¬ 
nally,  Section  8  concludes  the  paper  and  discusses  some  ongoing  research  issues. 

2.  Standard  Gaussian  process  response  surfaces.  We  start  by  reviewing 
the  standard  Gaussian  process  response  surface  methodology.  In  order  to  avoid  un¬ 
necessary  confusions  with  the  Bayesian  inversion  described  above,  we  rename  the 
posterior  density  solution  of  the  inverse  problem  7rp0st(m)  as  d  (m)  for  which  we  seek 
to  find  an  approximation.  Assume  we  are  given  a  training  set  {m i,di  =  d(rrq)}™=1 
where  G  RN  is  a  point  (training  site)  in  the  TV-dimensional  parameter  space  and 
its  corresponding  function  evaluation  is  di.  If  the  training  set  is  noise-free,  which  is 
the  case  in  this  paper  since  we  simply  evaluate  d  (nij),  the  Gaussian  process  response 
surface  method  is  a  Bayesian  interpolation  technique  that  aims  to  statistically  inter¬ 
polate  the  unknown  underlying  function  d  (m)  given  the  training  set  [56].  The  main 
idea  behind  Gaussian  process  response  surface  methods  is  to  assume  the  unknown 
deterministic  function  d( m)  to  be  a  random  function  realization  generated  from  a 
Gaussian  process  prior.  The  Gaussian  process  prior  should  be  therefore  sufficiently 
flexible,  which  is  assumed  from  now  on,  so  that  ideally  there  exists  a  random  function 
realization  that  is  indistinguishable  from  the  unknown  function.  Once  the  observ¬ 
able  data  are  obtained,  they  are  combined  with  the  Gaussian  process  prior  through  a 
Bayesian  framework  to  produce  predictions  for  the  unknown  function  d( m). 

By  definition,  a  random  function  d( m)  is  a  Gaussian  process  if  the  marginal 
density  tt  (d( mi),  d(rri2), . . . ,  d(mn))  is  a  multivariate  Gaussian,  for  any  set  of  points 
{irq}™=1.  A  Gaussian  process  is  completely  determined  by  a  mean  function  fi( m)  and 
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a  covariance  function  k  (mt,  rrij)  for  two  arbitrary  points  m*  and  m j.  Assume  that 
these  functions  are  given  for  now  (their  constructions  are  the  subjects  of  Sections  3.1 
and  3.2).  By  assigning  a  Gaussian  process  prior  on  the  random  function  d( m),  the 
joint  distribution  of  d{ m*)  with  n  observations  d0^s  =  [d( mi),  d(rri2), . . . ,  d(mn)]  at 
n  training  points  M  =  [mi, . . . ,  mn]  is  a  Gaussian  given  by 


7r(d(m*),dobs|M,m*)  =  J\f 


( 

P'obs 

K(M,M) 

K( M,m*)  ' 

l 

_  Mm*)  _ 

1 

Kt{  M,m*) 

fc(m*,  m*) 

) 

(2.1) 


where  the  matrix  if(M,  M)  is  computed  as  Kij  =  fc(m^mj),  and  &(rrq,m*)  the  ith 
element  of  the  column  vector  if(M,  m*).  The  availability  of  the  training  points  is 
assumed  for  now,  and  in  Section  4  we  will  present  an  adaptive  sampling  method  to 
select  these  points.  Using  conditional  distribution  of  multivariate  normal  [56,59],  the 
posterior  distribution  of  d( m*)  is  given  by 


TTpost  (rf(m* )  | M,  d0f,s ,  m* )  =  N  (e  {d(m*)}post ,  var  {d(m*)}post)  ,  (2.2) 

where  the  expectation  and  variance  read 

E{d( m*)}post=  +KT(M,  m*)  [K(M,  M)]”1  (do6s  - 

M  ( m*  )prior 

var{d( m*)}post  =  fc(m*,m*)  ~KT(M,  m*)  [AT(M,  M)]-1  K(M,  m*) 

V'1  V  ^ 

var{d(  m*)}prior 

Since  m*  is  arbitrary,  equation  (2.2)  is  the  posterior  distribution  of  d  (m)  at  any  m 
in  the  parameter  space.  The  Bayesian  interpretation  is  now  clear  as  follows.  Equation 
(2.3)  states  that  the  posterior  mean  function  at  m*  is  the  corrected  version  of  the  prior 
mean  function  using  the  observation  (or  measurement)  information  encoded  in  the 
second  term  on  the  right  side.  Furthermore,  the  posterior  error  bar  (or  the  posterior 
uncertainty)  is  reduced  once  the  prior  knowledge  and  observations  are  combined  as 
shown  in  equation  (2.4).  Indeed,  since  the  covariance  matrix  iG(M,M)  is  symmetric 
positive  definite,  and  hence  its  inverse,  the  second  term  on  the  right  side  is  positive, 
which  shows  that  var  {d( m*)}post  <  var  {d( m*)}prior. 

We  now  discuss  some  other  properties  of  Gaussian  process  response  surfaces  that 
are  useful  for  our  subsequent  developments.  To  begin,  we  introduce  the  mean  squared 
prediction  error  (MSPE)  with  respect  to  a  distribution.  Following  Santner  et  al  [59] 
we  define 


(2.3) 

(2.4) 


MSPE  (d(  m*), 


E 


d(  m*)  —  d{  m*) 


(2.5) 


where  d( m*)  is  a  generic  predictor  of  d( m*),  Fn  denotes  the  joint  distribution  of 
(d0bs,  d(m*)),  be.,  the  distribution  in  (2.1).  The  following  theorem  summarizes  some 
important  properties  of  Gaussian  process  response  surfaces. 

Theorem  2.1.  The  following  properties  hold  for  the  Gaussian  process  defined  in 

(2-1) 

i)  The  predicted  mean  function  (2.3)  is  the  unique  minimizer  of  the  MSPE  with 
respect  to  joint  distribution  (2.1).  Furthermore ,  it  is  a  linear  and  unbiased 
predictor.  That  is,  it  is  the  best  linear  unbiased  predictor  (BLUP). 
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ii)  The  predicted  mean  function  (2.3)  interpolates  the  unknown  functions  at  all 
points  in  the  training  set  M. 

Hi)  The  mean  square  prediction  error  incurred  by  the  predicted  mean  function  is 
exactly  the  variance  (2.4). 

Proof  See  Santner  et  al.  [59]  for  a  proof.  □ 

The  first  assertion  of  Theorem  2.1  therefore  suggests  that  one  should  use  the 
predicted  mean  function  (2.3)  as  the  predictor  for  the  unknown  random  function 
d( m).  The  second  assertion  implies  that  the  variance  (2.4)  is  zero  at  all  the  training 
points,  that  is,  the  predicted  uncertainty  at  the  training  points  is  zero.  Hence  all  the 
random  functions  generated  by  (2.2)  interpolate  the  observed  data.  Moreover,  the 
third  assertion  implies  that  the  posterior  variance  (2.4)  can  be  used  as  a  measure  of 
uncertainty  for  the  Gaussian  process  predictor  (2.3). 

We  next  relate  the  mean  predictor  (2.3)  with  radial  basis  interpolations.  If  the 
covariance  function  /c(-,  •)  is  of  radial-basis-function  type,  the  predicted  mean  function 
can  be  shown  to  be  a  radial  basis  interpolation  as  follows  [56] .  Let  us  denote 

a=[X(M)M)]-1(do6s-Mo6J. 

Substitute  a  into  (2.3)  we  obtain 


n 

<li(m*)  =  /Lt(m*)  +  m*),  (2.6) 

i= 1 

which  is  a  radial  basis  interpolation  of  the  error  between  the  predictor  and  the  prior 
mean.  As  a  result,  the  predictor  inherits  the  regularity  of  the  covariance  function, 
assuming  the  prior  mean  is  sufficiently  regular.  The  importance  of  the  mean  function 
and  the  covariance  function  is  now  clear.  They  reflect  our  prior  knowledge  about  what 
the  unknown  function  d( m)  is  likely  to  be.  For  example,  if  the  underlying  function 
is  expected  to  be  not  very  nonlinear  and  infinitely  smooth,  the  mean  function  can 
be  chosen  to  be  linear,  and  a  squared  exponential  function  (also  known  as  Gaussian 
kernel) 

Mm>i  mi)  =  exp  (Mljm;  -  mj|||_1^  ,  (2.7) 

can  be  used  as  the  covariance  function.  Here,  we  have  defined  the  Mahalanobis  norm 
as 


with  a  positive  definite  matrix  T,.  Typically,  is  chosen  to  be  a  constant  diagonal 
matrix  whose  diagonal  entries  are  inferred  from  the  training  data  [42,48].  Each  diag¬ 
onal  entry  corresponding  to  each  dimension  can  be  considered  as  the  length  scale  over 
which  the  predictor  changes  significantly  in  that  particular  dimension.  Statistically, 
these  length  scales  determine  distance  between  two  points  in  each  dimension  such 
that  the  predictions  at  these  points  are  uncorrelated.  In  other  words,  these  length 
scales  reflect  our  beliefs  about  the  smoothness  of  the  unknown  function  d( m).  For 
example,  if  the  ith  diagonal  entry,  is  small,  the  predictor  varies  rapidly  in  the  ith 
dimension  while  it  tends  to  be  constant  for  large  X) a. 

3.  Non-st  at  ionary  adaptive  Hessian-based  Gaussian  process  response 
surfaces. 
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3.1.  Prior  mean  construction.  We  next  discuss  on  how  to  choose  the  prior 
mean  function  and  postpone  the  construction  of  the  covariance  function  to  Section 
3.2.  For  our  problem  of  interest  where  the  unknown  underlying  function  d( m)  is  a 
probability  density,  its  mean  value  can  be  estimated  as 


d 


mean 


Jn  rf(m)  & }  <  1 

m  (Cl)  ~  m  (Cl)  ’ 


where  m  (Cl)  denotes  the  measure  of  £2,  and  the  inequality  is  obtained  from  the  fact 
that  the  domain  of  interest  Cl  is  a  subset  high  dimensional  parameter  spaces  over  which 
d( m)  is  a  bona  fide  density,  i.e.,  fRN  d( m)  dCl  =  1.  Clearly,  the  mean  value  dmean 
is  small  if  the  measure  of  the  domain  of  interest  is  sufficiently  large.  We  therefore 
expect  that  zero-mean  is  a  good  prior  information.  This  is  intuitively  meaningful 
since  d( m),  the  Bayesian  posterior  probability  density,  is  typically  significant  only  in 
the  neighborhood  of  the  modes,  while  it  tends  to  be  small  or  close  to  zero  elsewhere. 
On  the  other  hand,  since  the  approximation  approaches  the  prior  mean  for  points 
that  are  further  away  from  the  training  set,  the  zero- mean  prior  permits  reasonable 
approximations  for  regions  with  small  probability  density. 

3.2.  Adaptive  non-stationary  covariance  function.  Covariance  functions 
that  are  a  function  of  only  relative  distance  between  two  points,  e.g.,  equation  (2.7) 
with  constant  5],  are  known  as  stationary  covariance  functions.  However,  Gaussian 
processes  with  stationary  covariance  function  can  provide  accurate  predictors  only 
for  functions  with  nearly  constant  smoothness  since  stationary  lacks  the  ability  to 
adapt  to  variable  smoothness  of  the  unknown  function  of  interest.  A  number  of  non- 
stationary  covariance  functions  have  been  devised  in  literature,  see  [23,26,32,37,50,51, 
54-56,58,60]  for  examples.  Below,  we  rationalize  the  derivation  of  our  Hessian-based 
non-stationary  covariance  functions. 

We  begin  by  re-examining  the  predictor  (2.6)  with  zero-mean  Gaussian  process 
prior  as  argued  in  Section  3.1: 


n 

dn( m*)  ss  m*).  (3.1) 

i= 1 

If  n  =  1  and  if  the  covariance  is  of  Gaussian- type  as  in  equation  (2.7),  then  the  pre¬ 
dictor  in  (3.1)  is  nothing  more  than  a  Gaussian  approximation  to  d( m)  where  the 
covariance  matrix  is  given  by  5L  If,  in  addition,  X)-1  is  the  hessian  of  —  lnd(m),  i.e., 
X)-1  =  V2  (— lnd(m)),  then  the  predictor  becomes  the  popular  Laplace  approxima¬ 
tion  (see  [43]  and  references  therein).  That  is,  the  predictor  is  exact  if  the  underlying 
density  d( m)  is  a  Gaussian  whose  peak  is  at  mi. 

For  n  >  1,  it  is  natural  and  intuitive  to  generalize  the  Laplace  approximation  idea 
by  combining  local  Laplace  approximations  constructed  in  different  sub-domains.  The 
challenge  is  how  to  combine  them  to  form  a  global  approximation.  Our  idea  is  the 
following.  Since  the  Laplace  approximation  is  locally  an  accurate  approximation,  the 
contribution  of  &(rrq,  m*)  to  the  predictor  should  dominate  the  other  terms  k(m.j ,  m*) 
for  j  i,  if  m*  is  closest  to  m^.  In  order  to  fulfill  this  goal,  we  introduce  the  following 
non-stationary  covariance  function 

=  J2P(z  =  l\m)P(z  =  l\ m^exp 


(3.2) 
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where  Hi  =  V2  (— lnd(m/))  and  L  <  n  (to  be  shown  in  Section  4).  P(z  =  Z|rrq) 
can  be  considered  as  the  conditional  probability  of  having  selected  the  Zth  kernel 
exp  (— |||m^  —  mjjjljJ  given  m^,  and  z  is  known  as  latent  indicator  variable.  For 
example,  P{z  =  l\m.i)  should  approach  1  if  nij  — >>  m^,  and  zero  if  nij  is  far  away 
from  irq.  In  particular,  as  derived  in  Section  3.3,  the  following  form  of  P{z  =  Z|m^) 
satisfies  the  requirements 


P(z  =  Z  |  rrii ) 


(3.3) 


which  can  be  seen  as  a  special  form  of  a  well-known  class  of  soft-max  gating  networks 
in  machine  learning  community  [9].  Note  that  the  denominator  is  just  a  normalized 
constant  while  the  numerator  is  a  Gaussian  with  mean  rrq  and  inverse  covariance 
matrix  Hj. 

The  importance  of  the  Hessian  information  of  the  underlying  posterior  d( m)  is 
now  explained.  It  can  be  seen  that  the  Hessians  appear  two  times  in  the  covariance 
function  definition  (3.2).  First,  the  appearance  in  the  kernels  exp  (—  |||rrq  —  m^H^) 
ensures  that  the  predictor  is  the  desired  piecewise  Laplace  approximation.  Second, 
the  role  of  the  Hessians  in  the  membership  probability  P{z  =  Z|rn^)  is  to  guide  the  co- 
variance  function  to  pick  the  appropriate  dominant  kernel  in  the  predictor.  Note  that 
the  product  P{z  =  l\nii)P{z  =  l | nij )  is  necessary  because  it  not  only  guarantees  the 
symmetry  of  the  covariance  function  but  also  determines  the  appropriate  covariance 
structure  and  hence  the  smoothness  of  the  predictor.  By  definition,  /c(rrq,  m^)  is  the 
covariance  between  d(rrq)  and  d(mj).  The  covariance  in  turn  encodes  the  smooth¬ 
ness  of  the  predictor.  Moreover,  the  smoothness  of  a  surface  is  typically  measured 
by  its  second  derivatives,  i.e.,  the  Hessian.  These  suggest  that  the  Hessians  should 
be  used  in  the  covariance  to  shape  the  smoothness  of  the  predictor  accordingly.  Our 
covariance  function  in  (3.2)  is  built  based  on  this  intuition  (and  on  the  desire  to  have 
a  piecewise  Laplace  approximation).  That  is,  if  nij  and  nij  are  close  to  mg  with 
respect  to  the  norm  ||  •  ||/jz,  the  Zth  term  of  the  sum  on  the  right  hand  side  of  (3.2) 
dominates  the  covariance  while  its  contribution  to  the  covariance  is  small  if  either  rrq 
or  nij  is  far  away  from  m /. 

Under  the  piecewise  Laplace  approximation  view  point,  the  membership  prob¬ 
ability  is  served  as  an  automatic  mechanism  to  partition  the  high  dimensional  pa¬ 
rameter  space  into  overlapping  sub-regions  over  which  the  posterior  density  of  the 
inverse  problem  is  dominantly  interpolated  and  approximated  by  local  Gaussians.  As 
a  demonstration,  Figure  3.1  shows  three  membership  probabilities  corresponding  to 
three  modes  of  the  exact  density  d( m)  =  (—4, 1)  +  (0,  0.75)  +  (4,  0.5).  As 

can  be  seen,  each  membership  probability  is  one  if  rrq  is  close  to  its  corresponding 
mode  and  zero  otherwise.  Sub-regions  dominated  by  the  first  and  the  second  mem¬ 
bership  probabilities  do  not  overlap  while  they  do  with  that  of  the  third  one.  This 
reflects  the  fact  the  first  and  second  modes  do  not  overlap,  but  they  do  with  the 
third  one.  The  role  of  the  membership  probabilities  in  automatically  identifying  the 
dominant  local  Laplace  approximation  is  clearly  demonstrated  in  this  figure. 

A  question  that  needs  to  be  addressed  is  whether  (3.2)  defines  a  valid  covariance 
function.  This  is  important  since  a  Gaussian  process  exists  if  and  only  if  the  covariance 
function  is  valid,  according  to  the  Kolmogorov’s  existence  theorem  [2].  The  following 
result  answers  this  question. 

Theorem  3.1.  Assume  L  >  1  and  Hi  is  symmetric  positive  definite  V/  = 
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Fig.  3.1.  Three  membership  probabilities  corresponding  to  three  modes  of  the  exact  density 
d(m)  =  (-4,1)  +  |7V  (0,0.75)  +  (4, 0.5). 


1  Then 

L  (  1 
fc(mj,mj)  =  ^  P(z  =  l\m.i)P(z  =  Z|mj)exP  (  --|K  -  m 


i=i 


lill 


is  a  valid  non- stationary  covariance  function. 

Proof  The  non-stationary  is  clear  since  ^(m^mj)  a  function  of  not  only  the 
relative  distance  between  nij  and  m j  but  also  and  nij.  The  symmetry  is  also 
clear.  In  order  to  prove  that  the  kernel  k( m*,  m^)  is  positive  definite,  it  is  sufficient  to 
show  that  the  matrix  K  (M,  M)  is  positive  definite  Vn  >  1.  We  begin  by  the  following 
fact 


exp  (  — - ||mi  -  m 


(7r/2)^2|ffrT/2 


L 


exp  (— ||mj  -  mlllfj  exp  (-||m ,  -  m||^( 


Thus,  Vn  >  l,Vc  £  Mn,  we  have 


ctK  (M,  M)  c  =  ^  ^  CiCjk(m.i ,  m j)  =  ^ 


r  g/ 2)^/2iffrT/2 


mr 


|y^c,;P(z  =  IK)  exp  (—IK  -  Hlk)  j 


dfl  >  o. 


It  is  clear  that  the  equality  happens  if  and  only  if  the  term  in  the  curly  bracket  is  zero 
almost  everywhere  which  in  turn  happens  if  and  only  if  q  =  0,  Vi  =  1, . . . ,  n,  and  this 
completes  the  proof.  □ 
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The  importance  of  Theorem  3.1  is  now  clear.  It  ensures  the  positive  definiteness 
of  K  (M,  M),  which  in  turn  guarantees  its  invertibility  that  is  necessary  for  (2.3)  and 
(2.4). 

3.3.  Derivation  of  the  membership  probabilities.  Recall  that  the  member¬ 
ship  probability  P(z  =  /|m^)  is  the  conditional  probability  of  having  selected  the  Zth 
kernel  exp  (—  |||m^  —  m j\\2Hl)  given  m*.  Using  the  Bayes’  theorem  gives 

P(*  =  i|m,)  =  0xP(-—0  _ 

£p=i  =  p'!  ■  p(z  =  p) 

where  P(rrq|z  =  l)  and  P(z  =  l)  are  the  likelihood  and  prior  of  selecting  the  Zth 
kernel,  respectively.  Since  our  prior  knowledge  is  vague,  we  choose  P(z  =  l )  = 
1/L,VZ  =  1  ,...,L.  The  likelihood  is  chosen  to  be  the  (unnormalized)  Gaussian 
exp  (—  |||rrq  —  miJjlj-J  which  reflects  our  desire  that  the  likelihood  must  be  large, 
in  the  Mahalanobis  norm  ||  •  \\h^  as  m^  approaches  the  mean  rip.  The  extent  to 
which  the  likelihood  is  still  significant  is  determined  by  the  curvature  of  the  true  un¬ 
known  d(m),  which  appears  as  the  inverse  of  the  likelihood  covariance  matrix.  Hence, 
the  final  form  of  the  membership  probability  is  given  as  in  equation  (3.3). 

4.  Sequential  adaptive  sampling  strategy.  This  section  addresses  the  com¬ 
puter  experimental  design  issue  on  how  to  look  for  the  training  points  adaptively.  The 
method  we  are  going  to  describe  follows  our  previous  work  on  scalable  adaptive  algo¬ 
rithms  for  constructing  reduced  models  in  high-dimensional  parameter  spaces  [11,14]. 
To  begin,  we  define  the  following  generic  error  function 


£(m  ,n), 


(4.1) 


which  is  a  function  of  parameters  m  and  the  training  set  size  n  (and  the  training 
set  itself,  but  is  omitted  here  for  simplicity).  The  error  could  be,  for  example,  the 
squared  error  between  the  true  function  d(m)  and  the  predictor  cZn(m) 


Q(m,n) 


2 


dn(m)-d(m)  , 


(4.2) 


or  the  predictive  variance  in  (2.4)  as  the  error  indicator.  Generally,  for  Algorithm  1 
and  its  corresponding  theory  to  work,  it  is  desirable  that  the  cost  Q( m,  n)  be  less  then 
some  small  tolerance  6  at  all  training  points.  The  squared  error  and  the  predictive 
variance  clearly  satisfy  this  requirement  since  they  are  zero  at  all  the  training  points. 
We  first  outline  the  adaptive  sampling  algorithm  as  follows. 

Algorithm  1. 

Adaptive  Sampling  Algorithm 

1.  Given  a  set  of  training  points  {rrq,cZ*  =  d(m.i)}™=1,  solve  the  optimization 
problem 


maxg(m,  n),  (4.3) 

men 

to  find  the  location  in  parameter  space  at  which  the  error  is  maximized ,  i.e. 
find  m*  =  arg  max  Q  (m,  n) . 

2.  If  G(m*,n)  <  e,  where  £  is  the  desired  level  of  accuracy,  then  terminate  the 
algorithm.  If  not,  go  to  the  next  step. 

3.  With  m  =  m*;  compute  the  true  function  d( m*).  Update  the  predictor.  Go 
to  Step  1. 
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The  first  step  of  the  adaptive  sampling  algorithm  incrementally  find  the  next 
training  points  at  locations  where  Q(m ,  n),  as  a  measure  of  the  error  between  the 
underlying  true  posterior  density  and  the  current  Gaussian  process  response  surface 
approximation,  is  maximized.  The  structure  of  the  optimization  problem  that  must  be 
solved  in  each  adaptive  cycle  is  similar  to  that  of  PDE-constrained  inverse  problems, 
and  hence,  many  of  the  associated  tools  for  large-scale  optimization  can  be  recruited, 
in  particular  Newton-CG  solvers,  trust-region  globalization,  and  Eisenstat- Walker 
inexactness,  e.g.,  [3,4,6,7,11,21].  In  order  to  make  Algorithm  1  well-defined,  as 
discussed  in  the  next  section,  the  initial  guess  is  admissible  if  the  error  function  is  at 
least  6. 

If  the  maximum  error  is  less  than  the  prescribed  tolerance  e  in  Step  2,  the  algo¬ 
rithm  stops.  Otherwise,  Step  3  will  update  the  current  predictor  and  return  to  Step 
1.  In  particular,  the  training  set  is  updated  using  the  maximizer  m*  and  its  cor¬ 
responding  true  posterior  value  d( m*).  If  the  Hessian  of  the  negative  log  posterior, 
V2  (— lnd(m*)),  is  positive  definite,  the  adaptive  covariance  function  (3.2)  and  the 
membership  probability  (3.3)  are  updated  by  increasing  L  by  one.  That  is,  we  build 
a  local  Gaussian  approximation  whose  inverse  covariance  is  the  local  Hessian  of  the 
negative  log  posterior.  This  local  Gaussian  approximation  is  then  used  to  update  the 
Gaussian  process  covariance  function  and  the  membership  probability.  Effectively,  the 
method  builds  an  adaptive  sparse  non-stationary  Gaussian  process  that  is  generally 
improved  after  each  cycle.  The  method  identifies  regions  in  high  dimensional  spaces 
where  the  discrepancy  of  the  response  surface  is  maximal,  and  then  inserts  local  Gaus¬ 
sian  approximations  at  those  points  to  drive  the  response  surface  error  down.  Since 
the  Gaussian  process  response  surface  is  interpolating,  the  resulting  approximation  is 
identical  to  the  posterior  density  at  these  points.  Furthermore,  the  Hessian  ensures 
that  the  response  surface  locally  adapts  to  the  shape  of  posterior  d  (m)  accurately. 

5.  An  analysis  of  the  adaptive  Gaussian  process  response  surface.  In 

this  section,  each  adaptive  cycle  in  Algorithm  1  is  analyzed  to  show  that  the  whole 
algorithm  is  well  defined.  We  first  show  that  the  optimization  problem  (4.3)  has  a 
solution  under  suitable  assumptions. 

Proposition  5.1.  Assume  G(m,  n)  is  a  continuous  function  of  m  G  11,  where 
is  closed  and  bounded  subset  ofRN.  Then  there  exists  a  solution  for  the  optimization 
problem  (4.3). 

Proof  See  [11,14]  for  a  proof.  □ 

The  closedness  and  boundedness  of  the  domain  of  interest  Q  are  reasonable.  For 
example,  in  the  shape  inverse  electromagnetic  scattering  problem  studied  in  Section 
7,  the  shape  parameters  m  are  bounded  due  to  our  prior  belief  in  the  boundedness  of 
the  shape.  Proposition  5.1  therefore  implies  that  Steps  2  and  3  of  Algorithm  1  always 
go  through,  thus  each  cycle  certainly  finishes. 

Revisiting  previous  sampled  points  is  an  expensive  task  requiring  forward  and 
adjoint  solves,  and  hence  should  be  avoided.  On  the  other  hand,  distinction  of  sampled 
points  implies  the  non-singularity  of  K(M,  M),  which  is  vital  in  ensuring  the  existence 
and  uniqueness  of  the  the  predictor  and  its  uncertainty  in  (2.3)-(2.4).  Our  next  result 
shows  that  in  fact  Algorithm  1  always  finds  new  sampling  points. 

Theorem  5.2.  Algorithm  1  is  well-defined  in  the  sense  that  (i)  it  terminates  in 
finite  time  and  (ii)  all  sampled  points  are  distinct. 

Proof  See  [11,14]  for  a  proof.  □ 

The  next  question  needs  to  be  resolved  is  whether  the  proposed  adaptive  training 
(also  known  as  active  learning)  can  systematically  bias  the  inference.  Since  Bayesian 
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inference  is  consistent  with  the  likelihood  principle  [41]  which  states  that  the  inference 
should  depend  only  on  the  likelihood  of  the  data  that  is  actually  observed,  one  is  free 
to  choose  training  points  without  introducing  any  bias  to  the  inference. 

We  next  show  that  our  piecewise  Laplace  (Gaussian)  approximation  (2.3)  im¬ 
proves  as  the  number  of  training  points  increases. 

Theorem  5.3.  Denote  dn( m*)  as  the  mean  predictor  (2.3)  and  assume  ^n+l(m*)  7^ 
dn(m*),  Vn  G  N.  As  the  number  of  training  points  increases,  the  mean  squared  pre¬ 
diction  error  decreases  in  the  following  sense: 


r 

r  a  1 

21  f 

r  -  1 

E  < 

dn+ i(m*)  -  d( m*) 

<E 

>  Fn+1  f 

dn( m*)  -  d( m*) 

Fn 


Proof  We  have  the  following  inequalities  for  the  MSPE 


E 


2 


dn+i(m*)  -  d( m*) 


< 


where  d( m*)  denotes  an  arbitrary  linear  predictor.  The  first  inequality  holds  true  due 
to  the  minimization  property  of  dn+ i(m*)  in  the  first  part  of  Theorem  2.1.  The  second 
inequality  follows  from  choosing  d(m*)  =  dn( m*)  and  applying  the  marginalization 
property  of  the  multivariate  Gaussian  (2.1).  Note  that  the  second  inequality  is  strict 
due  to  the  assumption  dn+i(m*)  ^  dn( m*)  and  the  uniqueness  of  the  minimizer 
dn+ i(m*). 

□ 


6.  Error  function,  numerical  optimization,  initialization,  and  training 
set  update.  The  active  learning  method  proposed  in  Section  5  works  for  a  class  of 

r.  "I  2 

quite  general  error  functions.  For  our  purpose,  the  true  error  dn( m)  —  d( m)  turns 
out  to  be  a  good  candidate,  as  we  now  explain.  Recall  that  the  main  goal  of  this  paper 
is  to  find  as  many  modes  as  possible  and  then  to  interpolate  the  expensive-to-evaluate 
posterior  density  function  d( m)  using  a  piecewise  Laplace  approximation.  Intuitively, 
the  interpolation  is  statistically  more  accurate  if  it  captures  most  of  significant  proba¬ 
bility  regions  of  d( m).  In  order  to  fulfill  this  goal  heuristically,  we  design  Algorithm  1 
to  place  training  points  where  the  discrepancy  between  the  predictor  and  true  function 
is  largest  (at  least  locally).  Due  to  the  local  Gaussian  nature  of  the  predictor,  if  m 
is  sufficiently  far  away  from  rrq,  V/  =  1, . . . ,  L,  the  predictor  dn( m)  will  approach  the 

-i  2 


prior  mean,  which  is  zero,  and  hence  the  true  error 


dn( m)  -  d( m) 


will  approach 


[d( m)]2.  As  a  result,  the  worst-case  scenario  error  found  in  each  adaptive  cycle  is 
most  likely  a  mode  of  d{ m). 

As  also  discussed  in  Section  5,  D  is  typically  generated  by  simple  bound  con¬ 
straints  on  parameters  m.  Similar  to  our  previous  work  [11,14],  we  choose  to  solve 
the  bound-constrained  optimization: 


maxt/(m),  (6.1) 

m 

subject  to 


m-rnin  —  m  T  Ulmax  5 


(6.2) 
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using  a  subspace  trust  region  interior  reflective  inexact  Newton  conjugate  gradient 
method  described  in  [11]. 

Initialization  is  one  of  the  important  factors  determining  the  cost  that  the  opti¬ 
mization  solver  takes  to  converge.  In  particular,  if  the  initial  guess  is  far  away  from  the 
optimizer,  it  might  take  several  iterations  for  the  optimization  solver  to  move  to  the 
basin  of  attraction  where  the  designed  convergence  rate  takes  place,  e.g.,  quadratic,  if 
a  Newton  method  is  employed.  Therefore  in  order  to  reduce  the  cost,  it  is  vital  to  find 
a  good  initial  guess  for  the  optimization  problem  in  each  greedy  cycle.  Clearly,  the 
simplest  way  is  to  take  a  random  initialization  which  is  most  likely  not  to  be  close  to 
the  basin  of  attraction  of  any  local  optimizers.  Since  the  MSPE  (2.4)  is  analytical  and 
cheap  to  evaluate,  another  straightforward  idea  is  to  find  the  point  where  the  MSPE 
is  maximized  and  then  use  it  as  the  initial  guess.  However,  as  noticed  by  MacKay  [41], 
maximizers  of  MSPE  tend  to  be  at  the  boundary  of  the  domain  fl,  which  is  not  of 
our  interest. 

In  the  context  of  active  learning  (adaptive  sampling)  for  Gaussian  process,  Seo  et 
al.  [63]  numerically  shows  that  using  the  selection  criteria  by  Cohn  [18]  yields  a  more 
accurate  predictor  than  that  proposed  by  MacKay  [41].  The  reason  is  that  the  Cohn 
criteria  for  sampling  aims  to  minimize  the  mean  squared  error,  and  in  particular, 
it  maximizes  the  average  reduction  in  predictive  variance.  Nevertheless,  adaptive 
sampling  using  Cohn  criteria  is  expensive  as  discussed  in  Christen  and  Sanso  [17]. 
They  propose  a  cheap  alternative,  an  approximation  to  the  Cohn  criteria,  over  a 
test  set  Ma  =  [mf,...,m“0]  of  (random  or  grid)  points  in  the  parameter  space. 
Specifically,  the  selection  process  is  based  on  the  solution  of  the  following  optimization 
problem: 


max  J( mh,  (6.3) 

m“£Ma  V  lJ  V  7 

where 

7(ma,  =  + 

with  nij  G  M  and  m a-  G  Ma;  and  I ^  are  two  constants  independent  of  and 
are  defined  as 

na  n  n 

=  2,  Im  =  max  y>(m,,mi)|. 

mi  E1V1 

i=l  3=1  j  =  l 

Due  to  the  numerator,  the  maximizer  of  the  cost  in  (6.3)  should  have  high  predictive 
variance  and  be  highly  correlated  with  other  points  in  Ma.  At  the  same  time,  it 
should  not  be  so  close,  and  hence  less  correlated,  to  the  current  training  set  M,  due 
to  the  denominator.  Meanwhile,  an  approach  for  near-optimal  training  points  has 
been  proposed  [39].  However,  one  has  to  solve  a  combinatorial  optimization,  which 
we  try  to  avoid  here  since  it  is  prone  to  the  curse  of  dimensionality.  After  all,  we  need 
a  cheap  and  reasonably  good  initial  guess  and  then  devote  our  effort  to  the  continuous 
optimization  problem  (4.3)  using  efficient  and  scalable  optimization  solver,  which  is 
designed  to  be  immune  to  the  curse  of  dimensionality.  Keeping  this  goal  in  mind, 
we  choose  the  solution  of  (6.3)  as  the  initial  guess  for  three  reasons  besides  its  fast 
evaluation.  First,  it  is  not  so  close  to  the  current  training  set  about  which  we  have 
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already  learned.  Second,  since  it  has  high  predictive  variance  it  could  belong  to  some 
region  in  parameter  space  where  the  error  may  be  large.  Third,  due  to  its  high 
correlation  to  other  points  in  Ma,  learning  this  point  could  provide  us  information 
about  other  unvisited  points  as  well. 

The  question  we  would  like  to  address  next  is  how  to  update  the  training  set  M. 
As  discussed  in  Section  4,  the  covariance  function  and  the  membership  probabilities 
may  not  be  updated  after  a  greedy  cycle,  depending  on  whether  the  Hessian  of  the 
negative  log  posterior  V2  (—  lnd(m*))  is  (semi-)  positive  definite  or  not.  In  contrast, 
the  training  set  M  is  always  updated  after  each  greedy  cycle.  One  could  simply  add 
the  maximizer  of  the  optimization  problem  (4.3)  where  we  locally  observe  the  largest 
error.  This  point  is  currently  and  intuitively  the  best  to  learn  about  the  unknown 
function  d( m).  However,  in  addition  to  the  maximizer,  the  numerical  optimization 
solver  also  supplies  a  whole  trajectory  of  points  starting  from  the  initial  guess  to  the 
optimizer  where  the  unknown  function  d( m)  is  evaluated.  These  points  therefore 
contain  information  about  d( m)  about  which  we  are  trying  to  learn.  This  suggests 
that  we  should  add  the  whole  trajectory  to  the  current  training  set.  The  trade¬ 
off  is  that  the  condition  number  of  the  covariance  matrix  K  (M,  M)  may  increase, 
and  methods  for  inverting  the  covariance  matrix  accurately  and  efficiently  have  been 
addressed  elsewhere  [25].  Here,  we  simply  use  the  Cholesky  decomposition. 

7.  Numerical  experiments.  In  this  section,  the  proposed  Hessian-based  GP 
predictor  is  compared  to  the  state-of-the-art  radial  basis  function  (RBF)  interpolation 
[10].  Since  the  standard  stationary  GP  predictor  can  be  shown  to  be  equivalent  to  a 
RBF  whose  kernel  function  is  the  same  as  the  GP  covariance  function  [56],  we  just 
need  to  compare  our  method  to  a  standard  GP  predictor.  To  be  fair,  we  also  adapt 
the  shape  parameter  of  the  radial  basis  function  using  the  maximum  likelihood.  To 
ensure  that  the  cost  of  generating  our  GP  predictor  and  the  cost  of  generating  the  RBF 
predictor  are  more  or  less  the  same,  the  number  of  function  evaluations  are  forced  to 
be  the  same  for  both  methods.  In  particular,  the  number  of  function  evaluations  for 
the  Hessian-based  GP  predictor  is  counted  as  the  following.  First,  each  forward  solve 
is  counted  as  one.  Second,  each  gradient  computation  which  requires  an  adjoint  solve 
is  counted  as  one  assuming  the  costs  of  solving  the  forward  and  the  adjoint  are  the 
same.  Third,  each  Hessian- vector  product  required  in  the  conjugate  gradient  iterations 
is  counted  as  two  since  a  forward-like  and  an  adjoint-like  equations  have  to  be  solved. 
Thus  to  be  fair,  if  the  Hessian-based  GP  uses  rip  functions  evaluations,  the  RBF 
will  have  rip  interpolation  points.  As  a  popular  choice,  the  Latin  Hypercube  (LHC) 
sampling  is  used  to  generate  interpolation  points  for  the  RBF  approach.  Finally,  to 
assess  the  quality  of  the  predictors,  we  use  several  popular  discrete  norms,  namely, 
the  mean  squared  error,  the  £1-norm,  the  ^2-norm,  and  the  Hellinger  norm  [16]. 

7.1.  One-dimensional  examples.  The  first  example  considered  in  this  subsec¬ 
tion  is  a  mixture  of  three  Gaussians  given  by 

d{ m)  =  (-4, 1)  +  iv (0, 0.75)  +  l-N (4, 0.5) . 

Figure  7.1  shows  the  GP  predictor  together  with  its  uncertainty  given  by  95%  cred¬ 
ibility  envelope.  The  quality  of  the  GP  predictor  (3  greedy  cycles)  and  the  RBF 
predictor,  both  with  143  function  counts,  is  shown  in  Table  7.1.  As  expected,  the 
RBF  predictor  is  more  accurate  than  the  GP  predictor  for  low  dimensional  problem 
and  simple  density  d( m).  Note  that  since  the  squared  error  (4.2)  is  used  as  the  cost 
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in  the  adaptive  sampling  algorithm,  the  mean  squared  error  tends  to  be  smaller  than 
other  measures. 


m 

Fig.  7.1.  The  GP  predictor  together  with  its  uncertainty  given  by  95%  credibility  envelope 
versus  the  exact.  Crossed  are  the  points  in  the  training  set  M  found  by  the  greedy  algorithm. 


Table  7.1 

GP  predictor  error  versus  RBF  predictor  error  over  a  LHC  grid  of  20,000  points. 


Method 

MSE 

G-norm 

£2-norm 

Hellinger  norm 

GP 

6.55e-06 

4.35 

3.62e-01 

1.74 

RBF 

3.77e-16 

1.23e-02 

2.75e-06 

5.31e-03 

For  more  complicated  ID  density  with  anisotropy  and  localized  features  such  as 
the  one  in  Figure  7.2,  where  the  true  density  is  given  by 


5 

d{  m)  =  y](25“763)A f 
1=0 


the  GP  predictor  (with  371  function  counts  after  7  greedy  cycles)  starts  to  be  com¬ 
petitive  and  this  can  be  seen  in  Table  7.2. 


Table  7.2 

GP  predictor  error  versus  RBF  predictor  error  over  a  LHC  grid  of  20,000  points. 


Method 

MSE 

G-norm 

^2-norm 

Hellinger  norm 

GP 

1.45e-06 

3.15 

1.70e-01 

1.22 

RBF 

2.55e-3 

2.31el 

7.14 

1.04el 

As  mentioned  in  Section  6,  the  ability  of  our  proposed  method  in  seeking  the 
modes  of  the  underlying  density  d( m)  can  be  observed  in  Figures  7.1  and  7.2.  Again, 
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Fig.  7.2.  The  GP  predictor  together  with  its  uncertainty  given  by  95%  credibility  envelope 
versus  the  exact.  Crossed  are  the  points  in  the  training  set  M  found  by  the  greedy  algorithm. 


this  is  important  for  our  purpose  that  high  probability  density  regions  should  be 
captured  as  much  as  possible. 

7.2.  Two-dimensional  example.  We  consider  the  following  mixture  of  four 
anisotropic  Gaussian 

d( m)  oc^exp  ^-i[m  -  m°]T H°[m  -  m"]j  , 


where  the  local  means  are  given  by 


mi  =  [— !-5,  -1.5]T,  mg  =  [1.5,  1.5]t,  mg 


[  2,  2]t;  mg  =  [5,  — 5]r, 


the  local  inverse  covariance  matrices  read 


H 


o 

1 


10  0 
0  1 


>  H 2 


1  0 
0  10 


3 

1 


2  -1.5 

-1.5  2 


and  the  coefficients  c*  are  randomly  generated  as 


ci  =  0.25,  c2  =  0.3,  c3  =  0.38,  c4  =  0.07. 


A  GP  predictor  with  553  function  counts  after  5  greedy  cycles  is  presented  in  Figure 
7.3(b),  whereas  the  true  function  is  shown  in  Figure  7.3(a).  An  adaptive  RBF  with 
553  LHC  sampling  points  is  shown  in  Figure  7.3(c).  As  can  be  seen,  the  GP  predictor 
outperforms  the  adaptive  RBF.  To  further  confirm  this,  we  compute  different  discrete 
error  norms  over  90,000  uniform  grid  points  and  present  the  results  in  Table  7.3.  The 
errors  of  the  GP  predictor  are  orders  of  magnitude  smaller  than  those  of  the  adaptive 
RBF. 
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Adaptive  GP 


-5  0  5 

m 


Adaptive  RBF 


-5  0  5 

m 


(b)  Adaptive  GP  predictor 


(c)  Adaptive  RBF  predictor 


Fig.  7.3.  A  mixture  of  four  anisotropic  Gaussians  in  2D  example;  7.3(a)  the  exact  function, 
7.3(b)  a  GP  predictor  with  553  function  counts  after  5  greedy  cycles,  and  7.3(c)  an  adaptive  RBF 
predictor  with  553  LHC  sampling  points. 


7.3.  Ten-dimensional  examples.  The  first  example  in  this  subsection  is  the 
mixture  of  two  Gaussians  in  10-dimensional  space: 

d( m)  =  ClN  (m?,  £°)  +  c2V  (mg,  Eg) , 


Table  7.3 

GP  predictor  error  versus  RBF  predictor  error  over  a  uniform  grid  of  90,000  points. 


Method 

MSE 

£1-norm 

£2-norm 

Hellinger  norm 

GP 

3.10e-6 

4.91 

5.28e-l 

2.19 

RBF 

1.36e-4 

1.52el 

3.5 

9.23 
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where 


ci  =  0.8,  c2  =  0.2, 

mi  =  —  2  x  ones(  10, 1),  ttt,2  =  2  x  O77,es(10, 1), 

EJ  =  dia#[0.1629, 0.1812, 0.0254, 0.1827, 0.1265, 0.0195, 0.0557, 
0.1094,0.1915,0.1930], 

£3  =  ^[0.0315, 0.1941, 0.1914, 0.0971, 0.1601, 0.0284, 0.0844, 
0.1831,0.1584,0.1919], 


where  ones (10, 1)  is  a  10  x  1  column  vector  with  all  elements  equal  to  1,  and  diag 
puts  a  vector  on  the  main  diagonal  the  zero  matrix  of  corresponding  size.  The  domain 
of  interest  is  the  hypercube  [ — 3, 3] 10 .  The  GP  predictor  requires  three  greedy  cycles 
with  679  function  evaluations  to  capture  the  two  modes.  In  order  to  compare  the 
exact  function  d( m)  and  its  adaptive  GP  predictor,  we  sample  both  of  them  using 
DRAM  [28];  an  efficient  MCMC  toolbox.  Table  7.4  shows  the  sample  means  from  one 
million  MCMC  simulations.  Here,  it  is  not  our  attempt  to  run  enough  simulations 
until  the  MCMC  converges,  but  to  show  how  well  the  response  surface  emulates  the 
exact  one.  As  can  be  seen,  the  sample  means  after  one  million  MCMC  simulations 
are  the  same  up  to  three  digits,  though  they  are  by  no  means  close  to  the  exact  mean 
—  1.2  x  ones{  10, 1). 


Table  7.4 

The  sample  means  of  GP  predictor  and  the  exact  function  up  to  one  million  MCMC  simulations. 


mean 

GP  predictor 

exact  d( m) 

-1.9941 

-1.9923 

m2 

-1.9894 

-1.9923 

m3 

-2.0007 

-1.9996 

m4 

-1.9880 

-1.9880 

ra5 

-1.9961 

-1.9957 

m6 

-1.9995 

-1.9997 

777,7 

-2.0005 

-2.0018 

m8 

-2.0003 

-2.0006 

m9 

-1.9911 

-1.9847 

777,10 

-1.9891 

-1.9902 

Now,  if  we  take  679  LHC  points  for  the  RBF  approach,  all  the  function  values 
evaluated  at  these  points  are  machine  zero,  and  hence  the  RBF  method  would  give  a 
zero  response  surface,  which  is  by  no  means  close  to  the  exact  function.  This  is,  as 
discussed,  a  manifestation  of  the  curse  of  dimensionality. 

Similarly,  we  consider  the  mixture  of  two  10-dimensional  Cauchy  distributions: 


d{ m)  =  aC  (m°,  cr°)  +  c2C  (mg,  c t°2 ) 
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where 


10 


ci  =  0.65,  C2  =  0.35, 

mi  =  zeros(  10, 1), m2  =  2  x  ones(  10, 1), 

<r°i  =  [0.4074,0.4529,0.0635,0.4567,0.3162,0.0488,0.1392, 
0.2734,0.4788,0.4824], 

erg  =  [0.0315,0.1941,0.1914,0.0971,0.1601,0.0284,0.0844, 
0.1831,0.1584,0.1919], 


where  zeros(  10, 1)  is  the  zero  vector  of  dimension  10  x  1.  Similar  to  the  Gaussian 
case,  the  domain  of  interest  is  the  hypercube  [ — 3,  3] 10 .  The  GP  predictor  requires 
three  greedy  cycles  with  1386  function  evaluations  to  capture  the  two  modes.  Table 
7.5  compares  the  sample  means  obtained  from  the  GP  predictor  and  the  exact  d( m) 
using  one  million  MCMC  simulations.  The  result  is  reasonable  though  it  is  not  as 
good  as  the  Gaussian  case.  Similar  to  the  Gaussian  case,  if  we  use  the  adaptive 
RBF  method  with  1386  LHC  points,  d( m)  is  machine  zero  at  these  points,  and  hence 
yielding  a  zero  response  surface!  In  fact,  for  both  Gaussian  and  Cauchy  cases,  we 
have  tested  that  d( m)  is  close  to  machine  zero  even  for  100,000  LHC  points.  Again, 
the  curse  of  dimensionality  is  in  action.  This  observation  also  suggests  that  all  the 
discrete  norms  that  we  have  used  above  for  low  dimension  examples  be  useless  in  high 
dimensional  problems  because  they  are  most  likely  zero. 


Table  7.5 


The  sample  means  of  GP  predictor  and  the  exact  function  from  one  million  MCMC  simulations. 


exact  d(  m) 


mean  GP  predictor 


mi  2.0002 

m2  2.0007 

m3  1.9305 

m4  1.9980 

ms  1.9996 

m6  1.9991 

m7  1.9970 

m8  1.9984 

m9  1.9979 

mio  2.0002 


1.9761 

1.8883 

1.8976 

1.9321 

1.9424 

1.9962 

1.9428 

1.9161 

1.9236 

1.8985 


7.4.  Inverse  shape  electromagnetic  scattering  example.  In  this  section, 
we  consider  two  dimensional  transverse  magnetic  (TM)  polarization  in  the  context 
of  electromagnetic  scattering  due  to  a  scatterer  in  the  free  space.  The  governing 
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equations  read 

dHx  8EZ  _ 
dt  dy 

_9Hv,dE*=n 

dt  dx 

Ez  =  El 

Hx  =  Hy  =  Ez  =  0 


in  Q  x  (0, T), 

in  Q  x  (0, T), 

in  dQs  x  (0  ,T), 
in  x  {0}  , 


where  Hx,Hy  and  denotes  the  x  —  y  components  of  the  magnetic  field  and  z 
component  of  the  electric  field  with  appropriate  normalization  [29],  respectively.  Here, 
E1  =  cos  (8  (t  —  x))  is  the  incident  electric  field,  and  fls  the  scatterer  satisfying 
Qs  C  H  C  M2. 

Next  denoting 


Hx  = 


-Hy 

Hx 


E  =  EZ , 


and  using  the  perfect  matched  layer  (PML)  proposed  in  [1],  the  above  TM  equations 
become 


where 


<9H± 
ft  + 

\7E  = 

=  L 

in 

Vi  x 

(0  ,T), 

(7.1a) 

dE  „ 
~dt  +  V' 

Hx  = 

=  M 

in 

X 

(0,0, 

(7.1b) 

dP  _s 

dt 

dQ 

dt 

=  R 

in 

X 

(0,0, 

(7.1c) 

E  = 

=  -E1 

in 

dtts 

X  (0,0, 

(7-ld) 

E  =  0, 

Hx  = 

=  0 

in 

X 

{0}, 

(7.1e) 

P  =  0, 

Q  = 

=  0 

in 

Q  x 

{0}, 

(7- If) 

p  = 

[PxXyf, 

L  = 

=  AHi+BP, 

S  =  DHx, 

Q  = 

[Qxi  Qy]  5 

R  = 

=  GQ  +  FHi, 

M  =  CtQ, 

-2(jx  0 

0  — 2cr  y  5 


D  = 


~(?x  0 

0  Gy 


B  = 


ox  0 

0  ~Gy 


G  = 


-ax  0 

0  -Gy 


C  = 

F  = 


1  0 

0  -1  ’ 


with  ax  and  ay ,  the  defining  damping  property  of  the  PML  layer,  given  by 


To  \x\  <  l 

ax  =  <  15(x  —  l)2  x  >  1 
[  15(x  +  l)2  x  <  —  1 


[  o  M<i 

^  =  <  i5(y  - 1)2  y  >  i 
[  15 (y  + 1)2  y  <  -l 


A  typical  truncated  domain  [—1,1] 2  (fine  mesh)  together  with  the  PML  domain 
(coarse  mesh)  is  shown  in  Figure  7.4.  The  object  in  the  middle  of  the  domain  is 
a  scatterer.  We  also  show  a  typical  scattered  electric  field  solution  in  Figure  7.5. 
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Fig.  7.4.  A  typical  mesh  of  the  PML  and  truncated  domains  together  with  a  scatterer. 


Ez 


-2-1  0  1  2 


x 


Fig.  7.5.  An  example  of  scattered  electric  field. 


The  forward  problem  can  be  stated  as  follows.  Given  a  scatterer’s  shape,  the  goal 
is  to  compute  the  scattered  fields,  in  particular,  at  the  observation  points  denoted  as 
small  circles  in  Figure  7.5. 

In  the  inverse  problem,  on  the  other  hand,  the  task  is  to  reconstruct  the  scatterer’s 
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shape,  Q,s,  given  scattered  field  data,  possibly  polluted  by  noises,  at  K  observation 
points.  It  is  not  necessary  to  have  the  scattered  data  for  all  the  fields,  but  we  assume 
it  is  so  for  convenience.  We  choose  to  solve  the  inverse  problem  statistically  using  a 
Bayesian  framework  whose  details  can  be  found  in  [15,35].  Assuming  i.i.d.  Gaussian 
noise  with  zero  mean  and  variance  a2  at  all  observation  points,  the  likelihood  model 
is  chosen  as 


THike  ^ 


where  quantities  with  superscript  “obs”  are  the  observed  data  and  x  =  [x,y]T.  The 
Dirac  delta  function  £(.)  is  defined  as 


*  =  f  1  X  =  Xfc, 

(x-xfc)  |  q  otherwise. 

We  begin  the  prior  modeling  by  defining  the  admissible  shape  space.  In  this 
paper,  the  shape  parametrization  is  restricted  as 


ns 

r  =  ml  cos  ([i  —  1  \0) 


where  (r,  0)  are  polar  coordinates  of  the  shape,  and  ml  the  ith  shape  parameter.  As¬ 
sume  a  priori  that  the  unknown  shape  is  smooth  so  that  the  following  spline  smoothing 
can  be  employed  [57]: 


T^prior  CX-  GXp 

The  solution  to  the  statistical  inverse  problem  is  the  following  posterior  density, 
after  ignoring  the  normalized  constant  (which  is  not  required  by  Markov  chain  Monte 
Carlo  methods), 


d  (m)  —  TTiike  X  7Tprior- 
Denote  J  =  —  logd  (m),  then  we  have 

^(x— Xfc)  dDdt 


One  of  the  key  ingredients  of  our  approach  is  the  Hessian  (or  its  approximation)  of 
J .  For  large-scale  PDE-based  inverse  problem  such  as  the  inverse  scattering  example 
considered  in  this  section,  efficient  method  for  computing  the  Hessian  is  vital  and  we 
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adopt  an  adjoint  approach  to  fulfill  this  goal.  To  begin,  we  form  the  Lagrangian 

C  =  j  +  J  J  h±  •  +  V£  -  AHj_  -  BP^j  dCldt 

+  J  |  e^+V-Hi-C^j  dQdt  +  J  J  p  ■  -  DH^  dQdt 

+  j  J  q  •  -  GQ  -  FH±j  dQdt  +  J  J  A  (E  +  E1)  dsdt 

+  h /  •  Hj_dQ  +  /  p /  •  Pdfl  +  /  q j  •  Qdf2  +  I  e/Edfl. 

Jn  Jn  Jn  Jn 

The  first  order  Karush-Kuhn- Tucker  optimality  system  can  be  derived  as  follows: 

•  Taking  the  first  variation  of  the  Lagrangian  with  respect  to  h^,e,p,q  and 
arguing  that  the  variations  of  h_L,  e,  p,  q  are  arbitrary  in  x  (0,  T)  yield  the 
forward  equations  (7.1a)-  (7.1c). 

•  Taking  the  first  variation  of  the  Lagrangian  with  respect  to  A  and  arguing 
that  the  variation  of  A  is  arbitrary  in  dfls  x  (0,T)  yield  the  forward  PEC 
condition  (7. Id). 

•  Taking  the  first  variation  of  the  Lagrangian  with  respect  to  e/,  h/,  p/,  q/  and 
arguing  that  the  variations  of  e/,  h/,  p/,  q j  are  arbitrary  in  x  {0}  yield  the 
forward  initial  conditions  (7.1e)-(7.1f). 

•  Taking  the  first  variation  of  the  Lagrangian  with  respect  to  Hi,  E ,  P,  Q  and 
arguing  that  the  variations  of  H P,  Q  are  arbitrary  in  the  corresponding 
domains  yield  the  following  adjoint  equations  together  with  the  final  and 


boundary 

conditions 

<9h_L  _  T  * 

at  +Ve  =  L 

in  Q  x  (0, T), 

(7.2a) 

l  +  vh  x=«' 

in  Q  x  (0, T), 

(7.2b) 

dp 

dt 

=  — BThj_,  f^  =  GTq-eC 
dt 

in  Q  x  (0, T), 

(7.2c) 

e  =  0 

in  dtts  x  (0, T), 

(7.2d) 

e  =  0,  =  0 

in  Q  x  {T}  , 

(7.2e) 

p  =  0,  q  =  0 

in  x  {T}  , 

(7.2f) 

where 

L*  =  (H±  -  Hffes)  <5(x_Xfc)  -  Arh±  -  DTp  -  FTq, 

k= 1 
K 

M*  =  J2(E-E°kbs)6(x-xky 

k=l 

Other  adjoint  variables  are  found  to  be 

A  =  —  h±  •  n  in  dfls  x  (0,  T), 
ej  =  e,  h/  =  h_L,  p/  =  p,  q/  =  q  infix  {0}  . 
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•  Taking  derivatives  of  the  Lagrangian  with  respect  to  the  shape  parameters 
ml  can  be  done  using  the  shape  gradient  and  Hessian  methods  as  in  [19]. 
Here,  we  assume  that  the  obstacle  is  star-like  around  the  origin  and  hence 
simpler  route  is  possible  (see  [12]  and  references  therein).  The  derivatives  of 
the  Lagrangian  with  respect  to  rri1  turns  out  to  be 

BC  f  f27T 

Qi  =  -—  =  -  l  /  [h±  •  V  (E  +  E1)}  r  cos  ([*  -  1]  9)  d9dt.  (7.3) 

am1  JT  J0 

The  reduced  gradient  computation  at  a  particular  shape  m  is  now  ready.  One 
first  solves  the  forward  system  (7.1a)-(7.1f)  for  the  forward  states  and  forward  PML 
variables.  The  adjoint  system  (7.2a)-(7.2f)  is  then  solved  for  the  adjoint  states  and 
adjoint  PML  variables.  The  reduced  shape  gradient  is  now  available  by  evaluating 
the  right  hand  side  of  equation  (7.3).  It  is  clear  that  one  forward  and  one  adjoint 
solves  are  needed  for  the  shape  gradient  computation. 

In  order  to  compute  the  product  of  the  shape  Hessian  and  a  vector  of  shape 
variation,  we  first  compute  the  forward  variation,  involving  one  incremental  forward 
(linearization  of  the  forward  equations)  solve,  and  then  the  adjoint  variation,  involving 
one  incremental  adjoint  (linearization  of  the  adjoint  equations)  solve,  corresponding 
to  that  shape  variation  vector.  The  shape  Hessian- vector  product  is  the  total  variation 
of  the  shape  gradient  (7.3).  More  specifically,  the  variation  Sr  due  to  variation  c)m  is 
given  by 

ns 

Sr  =  cos  ([^  —  1]0)  • 

2=1 


The  ith  component  of  the  product  of  the  shape  Hessian  and  vector  c)m  can  be  shown 
to  be 


n  pZ  7T 

SGi  =  —  /  V  [h±  •  V  (E  +  E1)]  •  er  cos  ([*  -  1]  9)  5 rdOdt 

Jt  Jo 

-  [  [  [h|_  •  V  (E  +  E1)]  Sr cos  ([?  -  1|  9)  dOdt 

Jt  Jo 

p  p27T 

-  [<5h±  •  V  (E  +  E1)]  r  cos  ([*  -  1]  9)  dOdt 
Jt  Jo 

-  [hx  •  VSE]  r  cos  ([i  —  1]  0) 

Jt  Jo 


dOdt, 


where  e  =  [cos  0, sin  9} T ,  and  the  variations  in  the  forward  states,  SE  and  <5Hx,  satisfy 
the  following  incremental  forward  equations 


cMHj 

Ft 

d5E 


dt 


d5V 


dt 


=  SS, 


SE  =  0, 


VSE  =  SE 

■  <5Hx  =  SM 

dSQ, 


dt 


=  <5R 


SE  =  -V  (E  +  E1)  ■  eSr 

<5H  i  =  0 


<5P  =  0,  5Q  =  0 


in  H  x  (0, T), 
in  H  x  (0, T), 

in  H  x  (0, T), 

in  dQs  x  (0,  T), 
in  Q  x  {0}  , 
in  H  x  {0}  , 
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where 


SP  =  [ SPX ,  5Py]T ,  SL  =  A<5Hj_  +  Bi5P, 

<SS  =  D(5Hj_, 

<5Q  =  [5QX,  SQy]T ,  JR  =  G<SQ  +  F<SH_l, 

SM  =  CtSQ. 

Similarly,  the  variations  in  the  adjoint  states,  Se  and  £h_L,  satisfy  the  following 
mental  adjoint  equations 

cMhx  _ .  rT  * 

— — -=■  +  VSe  =  SL* 
at 

in  9  x  (0,  T), 

^  +  V  •  <5h_L  =  SM* 
at 

in  9  x  (0,  T), 

^  =  — BT<5h±,  ^  =  GT<5q  -  Se C 

dt  ’  dt  ^ 

in  9  x  (0,  T), 

Se  -  —We  •  e  Sr 

in  dfls  x  (0>^)> 

e  =  0,  Jhx  =  0 

in  9  x  {T}, 

rip  =  0.  4q  =  0 

in  9  x  {T }  , 

where 


K 


*L*  =  £<SH±<S(x_Xlt) 

k= 1 
K 

SM*  =J2SES(x_Xk). 


ATSh±  -  BTSp  -  FT£q, 


k= 1 


During  the  Newton  iterations  in  which  the  shape  is  updated,  we  generate  a  new 
corresponding  mesh  for  the  (incremental)  forward  and  (incremental)  adjoint  solves. 
To  simplify  the  implementation  and  to  avoid  difficulties  for  the  mesh  generator,  we 
allow  the  shape  parameters  to  vary  only  in  the  hyper-rectangle  defined  by 


mL  =  0.15  [1,  -1,  -1/2, . . . ,  -1/2(^-x)]T  , 
m  u  =  [0.27, 0.15, 0.15/2,...,  0.15/2(^-1)]: 


(7.4) 


As  a  consequence,  we  fix  the  time  step  and  hence  avoid  interpolating  the  solutions  in 
time  during  the  optimization  process. 

We  use  a  2nd-order  nodal  discontinuous  Galerkin  method  [29]  for  spatial  dis¬ 
cretization  and  the  classical  4th-order  Runge-Kutta  for  temporal  discretization  of  the 
(incremental)  forward  and  (incremental)  adjoint  equations.  The  mesh  has  4,494  tri¬ 
angles  and  the  total  number  of  nodal  unknowns,  for  both  Hi  and  E,  is  80,892.  The 
observation  data  Eobs  and  are  synthesized  by  solving  the  forward  solver  T  =  n 
using  4474  time  steps.  The  exact  shape  that  we  would  like  to  invert  for  is  governed 
by 


x  =  cos(t)  +  0.65  cos(2t)  —  0.65,  y  =  1.5  sin(t),  f  E  [0,  2ir\. 

This  exact  shape  and  the  corresponding  electric  field  at  T  =  it  are  shown  in  Figure 
7.5.  For  convenience,  we  compute  the  observations  at  all  time  steps  and  add  an  i.i.d. 
zero  mean  Gaussian  noise  with  a  =  0.05.  The  regularization  parameter  k  is  chosen 
to  be  1/  (cr2T) . 
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For  the  first  greedy  cycle,  we  use  the  circle  with  radius  of  0.236  as  the  initial  guess 
for  the  optimization  solver.  For  other  greedy  cycles,  the  initial  guesses  are  computed, 
as  in  Section  6,  from  the  test  set  Ma  containing  10000  LHC  points  distributed  within 
the  bound  hil  <  m  <  rip  . 

Next,  we  use  DRAM,  an  efficient  MCMC  package  in  [28],  to  sample  our  Gaussian 
process  predictor  with  one  million  MCMC  simulations.  Figure  7.6  shows  the  result 
for  29  training  points  and  one  global  Gaussian  approximation  after  one  greedy  cycle. 
Figure  7.7  shows  the  result  for  62  training  points  and  six  local  Gaussian  approxima¬ 
tions  after  ten  greedy  cycles.  Here,  we  plot  the  sample  posterior  mean  and  its  99.99% 
credibility  envelope  together  with  the  exact  shape  and  the  deterministic  solution.  As 
can  be  seen,  the  posterior  mean  predicts  well  the  left  side  of  the  kite  but  worse  on 
the  right.  This  is  expected  since  the  incident  wave  is  from  left  to  right.  The  credi¬ 
bility  region  responses  similarly,  namely,  the  uncertainty  is  less  on  the  left  and  grows 
gradually  as  we  move  to  the  right.  Nevertheless,  the  uncertainty  is  large,  even  on 
the  left  side,  at  the  center  of  the  concave  part  of  the  kite.  This  is  anticipated  since 
non-convex  regions  are  not  easy  to  be  reconstructed  [38].  It  is  interesting  to  observe 
that  within  the  bounds  of  interest  (7.4),  the  approximations  with  one  and  ten  greedy 
cycles  are  not  very  different.  It  suggests  that  there  be  a  single  dominant  mode  of  the 
Bayesian  posterior  density  which  is  already  captured  by  the  first  greedy  cycle.  One 
can  also  see  that  the  sample  mean  is  almost  identical  to  the  deterministic  solution. 


Fig.  7.6.  The  sample  posterior  mean  from  one  million  MCMC  simulations  together  with  its 
99.99%  credibility  envelope  versus  the  exact  and  the  deterministic  solutions.  The  Gaussian  predic¬ 
tor  is  obtained  after  one  greedy  cycle  accounting  for  29  training  points  and  one  global  Gaussian 
approximation. 
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Fig.  7.7.  The  sample  posterior  mean  from  one  million  MCMC  simulations  together  with  its 
99.99%  credibility  envelope  versus  the  exact  and  the  deterministic  solutions.  The  Gaussian  pre¬ 
dictor  is  obtained  after  ten  greedy  cycle  accounting  for  62  training  points  and  one  global  Gaussian 
approximation. 

It  should  be  pointed  out  that  a  million  MCMC  simulations  is  chosen  randomly, 
but  it  turns  out  that  the  MCMC  already  converges.  To  see  this,  we  perform  two  mil¬ 
lion  MCMC  simulations  and  show  again  the  sample  posterior  mean  together  with  its 
99.99%  credibility  envelope  versus  the  exact  and  the  deterministic  solutions  in  Figures 
7.8  and  7.9.  The  results  look  almost  unchanged  compared  to  those  in  Figures  7.6  and 
7.7.  To  further  confirm  the  convergence,  we  plot  two  dimensional  marginal  chains 
for  the  Gaussian  process  predictor  obtained  after  ten  greedy  cycle  using  one  million 
MCMC  simulations  in  Figures  7.10  and  7.11,  and  two  million  MCMC  simulations  in 
Figures  7.12  and  7.13.  As  can  be  observed,  the  marginal  chains  (only  the  first  eight 
parameters  are  shown)  converge  and  look  almost  indifferent  for  both  MCMC  runs. 

To  see  the  efficiency  of  the  Gaussian  response  surface  method,  we  compare  the 
CPU  time  taken  for  two  million  MCMC  simulations  for  one  greedy  cycle  case  in 
Table  7.6.  The  offline  time  is  defined  as  the  time  taken  to  build  the  Gaussian  process 
response  surface.  It  turns  out  that  it  took  about  33  hours,  while  there  would  have 
been  no  offline  cost  if  we  had  used  the  exact  Bayesian  posterior  density.  However,  the 
offline  time  was  paid  off  when  we  performed  the  MCMC  simulations.  In  particular, 
the  MCMC  simulations  required  almost  one  hour  for  the  GP  response  surface,  but  it 
would  have  taken  528141  hours  if  we  had  used  the  exact  posterior  density! 

8.  Conclusions.  We  have  developed  an  adaptive  Hessian-based  non-stationary 
Gaussian  process  response  surface  method  to  approximate  a  probability  density  func- 
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Table  7.6 

CPU  time  taken  for  two  million  MCMC  simulations. 


Gaussian  process  predictor 

exact  posterior  density 

offline  time 

33  hours 

0  hours 

online  time 

0.96  hours 

528141.75  hours 

tion  (pdf)  that  exploits  its  structure,  in  particular  the  Hessian  of  its  negative  loga¬ 
rithm.  Of  particular  interest  to  us  are  expensive-to-evaluate  pdfs,  e.g.,  those  arising 
from  the  Bayesian  solution  of  large-scale  inverse  problems.  Our  method  can  be  con¬ 
sidered  as  a  piecewise  adaptive  Gaussian  approximation  in  which  a  Gaussian  tailored 
to  the  local  Hessian  of  the  negative  log  probability  density  is  constructed  for  each 
sub-region  in  high  dimensional  parameter  space.  The  task  of  efficiently  partition¬ 
ing  the  parameter  space  into  sub-regions  is  done  implicitly  through  Hessian-informed 
membership  probability  functions.  The  Gaussian  process  machinery  is  then  employed 
to  glue  all  local  Gaussian  approximations  into  a  global  analytical  response  surface 
that  is  far  cheaper  to  evaluate  than  the  original  expensive  probability  density.  The 
resulting  response  surface  is  also  equipped  with  an  analytical  variance  estimate  that 
can  be  used  to  assess  the  uncertainty  of  the  approximation.  One  of  the  key  com- 


Fig.  7.8.  The  sample  posterior  mean  from  two  million  MCMC  simulations  together  with  its 
99.99%  credibility  envelope  versus  the  exact  and  the  deterministic  solutions.  The  Gaussian  predic¬ 
tor  is  obtained  after  one  greedy  cycle  accounting  for  29  training  points  and  one  global  Gaussian 
approximation. 
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ponents  of  our  proposed  approach  is  an  adaptive  sampling  strategy  for  exploring  the 
parameter  space  efficiently  during  the  computer  experimental  design  step,  which  aims 
to  find  training  points  with  high  probability  density.  The  detailed  construction  and 
an  analysis  of  the  method  have  been  presented.  We  have  demonstrated  the  accuracy 
and  efficiency  of  the  proposed  method  on  several  example  problems,  including  inverse 
shape  electromagnetic  scattering  in  24-dimensional  parameter  space. 

Ongoing  research  aims  to  address  the  following: 

1.  The  Gaussian  process  predictor  is  not  guaranteed  to  be  non- negative  every¬ 
where  (even  though  it  seems  to  be  the  case  for  all  examples  considered  in  this 
paper).  How  to  enforce  the  positiveness  of  the  predictor  in  our  framework 
remains  an  open  question,  though  one  may  use  an  approach  proposed  in  [52]. 

2.  Rigorous  analysis  of  the  quality  of  the  Gaussian  process  predictor  is  clearly 
an  important  direction  for  future  work. 

3.  The  size  of  the  random  set  Ma  is  quite  arbitrary.  Intuitively,  the  larger  it  is, 
the  better  our  predictor.  Thus,  the  question  that  needs  to  be  addressed  is 
how  to  choose  Ma  as  a  function  of  the  dimension  of  the  parameter  space. 

4.  Constructing  the  full  Hessian  for  local  Gaussian  approximation  is  prohibitively 
expensive  for  practical  inverse  problems  in  high  dimensional  parameter  spaces. 
However,  often,  one  can  make  accurate  low  rank  approximations  of  the  Hes- 
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Fig.  7.9.  The  sample  posterior  mean  from  two  million  MCMC  simulations  together  with  its 
99.99%  credibility  envelope  versus  the  exact  and  the  deterministic  solutions.  The  Gaussian  pre¬ 
dictor  is  obtained  after  ten  greedy  cycle  accounting  for  62  training  points  and  one  global  Gaussian 
approximation. 
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sian  for  ill-posed  inverse  problem;  scalable  algorithms  can  be  constructed  for 
this  task  [22]  and  theoretical  justification  of  the  compactness  of  the  Hessian 
can  be  provided  in  certain  cases  [12, 13]. 

5.  We  have  concentrated  on  the  detailed  development  and  analysis  of  the  pro¬ 
posed  method,  and  on  its  verification  for  several  examples.  We  have  compared 
our  approach  only  to  the  popular  adaptive  radial  basis  function  method.  Our 
future  work  should  carry  out  extensive  comparisons  with  other  existing  meth¬ 
ods  as  well. 

6.  The  adaptive  Hessian-based  non-stationary  Gaussian  process  response  surface 
method  is  designed  for  problems  involving  thousands  of  parameters  or  more. 
Ongoing  research  is  to  apply  the  method  to  such  high-dimensional  problems 
as  well. 

Acknowledgements.  We  thank  Lucas  Wilcox  for  many  fruitful  discussions,  and 
Zhengdong  Lu  for  valuable  suggestion  on  the  soft-max  function  in  [9] .  We  also  thank 
James  Martin  for  many  useful  discussions  on  MCMC  methods  and  their  convergence. 
This  research  was  supported  by  AFOSR  grant  FA9550-09-1-0608;  DOE  grants  DE- 
SC0002710,  DE-FC52-08NA28615,  and  DEFC02-06ER25782;  and  NSF  grants  CMS- 
1028889,  OPP-0941678,  DMS-0724746,  and  CMS-0619078. 

REFERENCES 


-0.04 


0.2  0.22  0.24  0.26  -0.1  -0.08  -0.06  -0.04  -0.1  -0.05  0 
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